The strategy-stealing assumption
Suppose that 1% of the world’s resources are controlled by unaligned AI, and 99% of the world’s resources are controlled by humans. We might hope that at least 99% of the universe’s resources end up being used for stuff-humans-like (in expectation).
Jessica Taylor argued for this conclusion in Strategies for Coalitions in Unit-Sum Games: if the humans divide into 99 groups each of which acquires influence as effectively as the unaligned AI, then by symmetry each group should end, up with as much influence as the AI, i.e. they should end up with 99% of the influence.
This argument rests on what I’ll call the strategy-stealing assumption: for any strategy an unaligned AI could use to influence the long-run future, there is an analogous strategy that a similarly-sized group of humans can use in order to capture a similar amount of flexible influence over the future. By “flexible” I mean that humans can decide later what to do with that influence — which is important since humans don’t yet know what we want in the long run.
Why might the strategy-stealing assumption be true?
Today there are a bunch of humans, with different preferences and different kinds of influence. Crudely speaking, the long-term outcome seems to be determined by some combination of {which preferences have how much influence?} and {what is the space of realizable outcomes?}.
I expect this to become more true over time — I expect groups of agents with diverse preferences to eventually approach efficient outcomes, since otherwise there are changes that every agent would prefer (though this is not obvious, especially in light of bargaining failures). Then the question is just about which of these efficient outcomes we pick.
I think that our actions don’t effect the space of realizable outcomes, because long-term realizability is mostly determined by facts about distant stars that we can’t yet influence. The obvious exception is that if we colonize space faster, we will have access more resources. But quantitatively this doesn’t seem like a big consideration, because astronomical events occur over millions of millennia while our decisions only change colonization timelines by decades.
So I think our decisions mostly affect long-term outcomes by changing the relative weights of different possible preferences (or by causing extinction).
Today, one of the main ways that preferences have weight is because agents with those preferences control resources and other forms of influence. Strategy-stealing seems most possible for this kind of plan — an aligned AI can exactly copy the strategy of an unaligned AI, except the money goes into the aligned AI’s bank account instead. The same seems true for most kinds of resource gathering.
There are lots of strategies that give influence to other people instead of helping me. For example, I might preferentially collaborate with people who share my values. But I can still steal these strategies, as long as my values are just as common as the values of the person I’m trying to steal from. So a majority can steal strategies from a minority, but not the other way around.
There can be plenty of strategies that don’t involve acquiring resources or flexible influence. For example, we could have a parliament with obscure rules in which I can make maneuvers that advantage one set of values or another in a way that can’t be stolen. Strategy-stealing may only be possible at the level of groups — you need to retain the option of setting up a different parliamentary system that doesn’t favor particular values. Even then, it’s unclear whether strategy-stealing is possible.
There isn’t a clean argument for strategy-stealing, but I think it seems plausible enough that it’s meaningful and productive to think of it as a plausible default, and to look at ways it can fail. (If you found enough ways it could fail, you might eventually stop thinking of it as a default.)
Eleven ways the strategy-stealing assumption could fail
In this section I’ll describe some of the failures that seem most important to me, with a focus on the ones that would interfere with the argument in the introduction.
1. AI alignment
If we can build smart AIs, but not aligned AIs, then humans can’t necessarily use AI to capture flexible influence. I think this is theπ most important way in which strategy-stealing is likely to fail. I’m not going to spend much time talking about it here because I’ve spent so much time elsewhere.
For example, if smart AIs inevitably want to fill the universe with paperclips, then “build a really smart AI” is a good strategy for someone who wants to fill the universe with paperclips, but it can’t be easily stolen by someone who wants anything else.
2. Value drift over generations
The values of 21st century humans are determined by some complicated mix of human nature and the modern environment. If I’m a 16th century noble who has really specific preferences about the future, it’s not really clear how I can act on those values. But if I’m a 16th century noble who thinks that future generations will inevitably be wiser and should get what they want, then I’m in luck, all I need to do is wait and make sure our civilization doesn’t do anything rash. And if I have some kind of crude intermediate preferences, then I might be able to push our culture in appropriate directions or encourage people with similar genetic dispositions to have more kids.
This is the most obvious and important way that strategy-stealing has failed historically. It’s not something I personally worry about too much though.
The big reason I don’t worry is some combination of common-sense morality and decision-theory: our values are the product of many generations each giving way to the next one, and so I’m pretty inclined to “pay it forward.” Put a different way, I think it’s relatively clear I should empathize with the next generation since I might well have been in their place (whereas I find it much less clear under what conditions I should empathize with AI). Or from yet another perspective, the same intuition that I’m “more right” than previous generations makes me very open to the possibility that future generations are more right still. This question gets very complex, but my first-pass take is that I’m maybe an order of magnitude less worried than about other kinds of value drift.
The small reason I don’t worry is that I think this dynamic is probably going to be less important in the future (unless we actively want it to be important — which seems quite possible). I believe there is a good chance that within 60 years most decisions will be made by machines, and so the handover from one generation to the next will be optional.
That all said, I am somewhat worried about more “out of distribution” changes to the values of future generations, in scenarios where AI development is slower than I expect. For example, I think it’s possible that genetic engineering of humans will substantially change what we want, and that I should be less excited about that kind of drift. Or I can imagine the interaction between technology and culture causing similarly alien changes. These questions are even harder to think about than the basic question of “how much should I empathize with future generations?” which already seemed quite thorny, and I don’t really know what I’d conclude if I spent a long time thinking. But at any rate, these things are not at the top of my priority queue.
3. Other alignment problems
AIs and future generations aren’t the only optimizers around. For example, we can also build institutions that further their own agendas. We can then face a problem analogous to AI alignment — if it’s easier to build effective institutions with some kinds of values than others, then those values could be at a structural advantage. For example, we might inevitably end up with a society that optimizes generalizations of short-term metrics, if big groups of humans are much more effective when doing this. (I say “generalizations of short-term metrics” because an exclusive focus on short-term metrics is the kind of problem that can fix itself over the very long run.)
I think that institutions are currently considerably weaker than humans (in the sense that’s relevant to strategy-stealing) and this will probably remain true over the medium term. For example:
A company with 10,000 people might be much smarter than any individual humans, but mostly that’s because of its alliance with its employees and shareholders — most of its influence is just used to accumulate more wages and dividends. Companies do things that seem antisocial not because they have come unmoored from any human’s values, but because plenty of influential humans want them to do that in order to make more money. (You could try to point the “market” as an organization with its own preferences, but it’s even worse at defending itself than bureaucracies — it’s up to humans who benefit from the market to defend it.)
Bureaucracies can seem unmoored from any individual human desire. But their actual ability to defend themselves and acquire resources seems much weaker than other optimizers like humans or corporations.
Overall I’m less concerned about this than AI alignment, but I do think it is a real problem. I’m somewhat optimistic that the same general principles will be relevant both to aligning institutions and AIs. If AI alignment wasn’t an issue, I’d be more concerned by problems like institutional alignment.
4. Human fragility
If AI systems are aligned with humans, they may want to keep humans alive. Not only do humans prefer being alive, humans may need to survive if they want to have the time and space to figure out what they really want and to tell their AI what to do. (I say “may” because at some point you might imagine e.g. putting some humans in cold storage, to be revived later.)
This could introduce an asymmetry: an AI that just cares about paperclips can get a leg up on humans by threatening to release an engineered plague, or trashing natural ecosystems that humans rely on. (Of course, this asymmetry may also go the other way — values implemented in machines are reliant on a bunch of complex infrastructure which may be more or less of a liability than humanity’s reliance on ecosystems.)
Stepping back, I think the fundamental long-term problem here is that “do what this human wants” is only a simple description of human values if you actually have the human in hand, and so an agent with these values does have a big extra liability.
I do think that the extreme option of “storing” humans to revive them later is workable, though most people would be very unhappy with a world where that becomes necessary. (To be clear, I think it almost certainly won’t.) We’ll return to this under “short-term terminal preferences” below.
5. Persuasion as fragility
If an aligned AI defines its values with reference to “whatever Paul wants,” then someone doesn’t need to kill Paul to mess with the AI, they just need to change what Paul wants. If it’s very easy to manipulate humans, but we want to keep talking with each other and interacting with the world despite the risk, then this extra attack surface could become a huge liability.
This is easier to defend against — just stop talking with people except in extremely controlled environments where you can minimize the risk of manipulation — but again humans may not be willing to pay that cost.
The main reason this might be worse than point 4 is that humans may be relatively happy to physically isolate themselves from anything scary, but it would be much more costly for us to cut off from contact with other humans.
6. Asymmetric persuasion
Even if humans are the only optimizers around, it might be easier to persuade humans of some things than others. For example, you could imagine a world where it’s easier to convince humans to endorse a simple ideology like “maximize the complexity of the universe” than to convince humans to pursue some more complex and subtle values.
This means that people with easily-persuadable values can use persuasion as a strategy, and people with other values cannot copy it.
I think this is ultimately more important than fragility, because it is relevant before we have powerful AI systems. It has many similarities to “value drift over generations,” and I have some mixed feelings here as well — there are some kinds of argument and deliberation that I certainly do endorse, and to the extent that my current views are the product of significant amounts of non-endorsed deliberation I am more inclined to be empathetic to future people who are influenced by increasingly-sophisticated arguments.
But as I described in section 2, I think these connections can get weaker as technological progress moves us further out of distribution, and if you told me that e.g. it was possible to perform a brute force search and find an argument that could convince someone to maximize the complexity of the future, I wouldn’t conclude that it’s probably fine if they decided to do that.
(Credit to Wei Dai for emphasizing this failure mode.)
7. Value-sensitive bargaining
If a bunch of powerful agents collectively decide what to do with the universe, I think it probably won’t look like “they all control their own slice of the universe and make independent decisions about what to do.” There will likely be opportunities for trade, they may have meddling preferences (where I care what you do with your part of the universe), there may be a possibility of destructive conflict, or it may look completely different in an unanticipated way.
In many of these settings the outcome is influenced by a complicated bargaining game, and it’s unclear whether the majority can steal a minority’s strategy. For example, suppose that there are two values X and Y in the world, with 99% X-agents and 1% Y-agents. The Y-agents may be able to threaten to destroy the world unless there is an even split, and the X-agents have no way to copy such a strategy. (This could also occur over the short term.)
I don’t have a strong view about the severity of this problem. I could imagine it being a big deal.
8. Recklessness
Some preferences might not care about whether the world is destroyed, and therefore have access to productive but risky strategies that more cautious agents cannot copy. The same could happen with other kinds of risks, like commitments that are game-theoretically useful but risk sacrificing some part of the universe or creating long-term negative outcomes.
I tend to think about this problem in the context of particular technologies that pose an extinction risk, but it’s worth keeping in mind that it can be compounded by the existence of more reckless agents.
Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident. There are fewer people who are in fact trying to kill everyone, but I think not enough fewer to tip the balance. (This is a contingent fact about technology though; it could change in the future and I could easily be wrong even today.)
9. Short-term unity and coordination
Some actors may have long-term values that are easier to talk about, represent formally, or reason about. Relative to humans, AIs may be especially likely to have such values. These actors could have an easier time coordinating, e.g. by pursuing some explicit compromise between their values (rather than being forced to find a governance mechanism for some resources produced by a joint venture).
This could leave us in a place where e.g. an unaligned AI controls 1% resources, but the majority of resources are controlled by humans who want to acquire flexible resources. Then the unaligned AIs can form a coalition which achieves very high efficiencies, while the humans cannot form 99 other coalitions to compete.
This could theoretically be a problem without AI, e.g. a large group of human with shared explicit values might be able to coordinate better and so leave normal humans at a disadvantage, though I think this is relatively unlikely as a major force in the world.
The seriousness of this problem is bounded by both the efficiency gains for a large coalition, and the quality of governance mechanisms for different actors who want to acquire flexible resources. I think we have OK solutions for coordination between people who want flexible influence, such that I don’t think this will be a big problem:
The humans can participate in lotteries to concentrate influence. Or you can gather resources to be used for a lottery in the future, while still allowing time for people to become wiser and then make bargains about what to do with the universe before they know who wins.
You can divide up the resources produced by a coalition equitably (and then negotiate about what to do with them).
You can modify other mechanisms by allowing votes that could e.g. overrule certain uses of resources. You could have more complex governance mechanisms, can delegate different kinds of authority to different systems, can rely on trusted parties, etc.
Many of these procedures work much better amongst groups of humans who expect to have relatively similar preferences or have a reasonable level of trust for other participants to do something basically cooperative and friendly (rather than e.g. demanding concessions so that they don’t do something terrible with their share of the universe or if they win the eventual lottery).
(Credit to Wei Dai for describing and emphasizing this failure mode.)
10. Weird stuff with simulations
I think civilizations like ours mostly have an impact via the common-sense channel where we ultimately colonize space. But there may be many civilizations like ours in simulations of various kinds, and influencing the results of those simulations could also be an important part of what we do. In that case, I don’t have any particular reason to think strategy-stealing breaks dow but I think stuff could be very weird and I have only a weak sense of how this influences optimal strategies.
Overall I don’t think much about this since it doesn’t seem likely to be a large part of our influence and it doesn’t break strategy-stealing in an obvious way. But I think it’s worth having in mind.
11. Other preferences
People care about lots of stuff other than their influence over the long-term future. If 1% of the world is unaligned AI and 99% of the world is humans, but the AI spends all of its resources on influencing the future while the humans only spend one tenth, it wouldn’t be too surprising if the AI ended up with 10% of the influence rather than 1%. This can matter in lots of ways other than literal spending and saving: someone who only cared about the future might make different tradeoffs, might be willing to defend themselves at the cost of short-term value (see sections 4 and 5 above), might pursue more ruthless strategies for expansion, and so on.
I think the simplest approximation is to restrict attention to the part of our preferences that is about the long-term (I discussed this a bit in Why might the future be good?). To the extent that someone cares about the long-term less than the average actor, they will represent a smaller fraction of this “long-term preferences” mixture. This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing. Even this advantage might be clawed back by a majority (e.g. by taxing savers).
There are a few places where this picture seems a little bit less crisp:
Rather than being able to spend resources on either the short or long-term, sometimes you might have preferences about how you acquire resources in the short-term; an agent without such scruples could potentially pull ahead. If these preferences are strong, it probably violates strategy-stealing unless the majority can agree to crush anyone unscrupulous.
For humans in particular, it may be hard to separate out “humans as repository of values” from “humans as an object of preferences,” and this may make it harder for us to defend ourselves (as discussed in sections 4 and 5).
I mostly think these complexities won’t be a big deal quantitatively, because I think our short-term preferences will mostly be compatible with defense and resource acquisition. But I’m not confident about that.
Conclusion
I think strategy-stealing isn’t really true; but I think it’s a good enough approximation that we can basically act as if it’s true, and then think about the risk posed by possible failures of strategy-stealing.
I think this is especially important for thinking about AI alignment, because it lets us formalize the lowered goalposts I discussed here: we just want to ensure that AI is compatible with strategy-stealing. These lowered goalposts are an important part of why I think we can solve alignment.
In practice I think that a large coalition of humans isn’t reduced to strategy-stealing — a majority can simply stop a minority from doing something bad, rather than by copying it. The possible failures in this post could potentially be addressed by either a technical solution or some kind of coordination.
- What does it take to defend the world against out-of-control AGIs? by 25 Oct 2022 14:47 UTC; 200 points) (
- Seeking Power is Often Convergently Instrumental in MDPs by 5 Dec 2019 2:33 UTC; 162 points) (
- [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now? by 26 Jan 2022 15:23 UTC; 156 points) (
- 2019 Review: Voting Results! by 1 Feb 2021 3:10 UTC; 99 points) (
- Homogeneity vs. heterogeneity in AI takeoff scenarios by 16 Dec 2020 1:37 UTC; 97 points) (
- My AGI Threat Model: Misaligned Model-Based RL Agent by 25 Mar 2021 13:45 UTC; 74 points) (
- When “yang” goes wrong by 8 Jan 2024 16:35 UTC; 72 points) (
- What’s important in “AI for epistemics”? by 24 Aug 2024 1:27 UTC; 66 points) (EA Forum;
- AI Regulation May Be More Important Than AI Alignment For Existential Safety by 24 Aug 2023 11:41 UTC; 65 points) (
- When “yang” goes wrong by 8 Jan 2024 16:35 UTC; 56 points) (EA Forum;
- Credence polls for 26 claims from the 2019 Review by 9 Jan 2021 7:13 UTC; 54 points) (
- Stages of Survival by 31 May 2023 18:30 UTC; 44 points) (
- What does it take to defend the world against out-of-control AGIs? by 25 Oct 2022 14:47 UTC; 43 points) (EA Forum;
- What’s important in “AI for epistemics”? by 24 Aug 2024 1:27 UTC; 41 points) (
- Operationalizing compatibility with strategy-stealing by 24 Dec 2020 22:36 UTC; 41 points) (
- Let’s use AI to harden human defenses against AI manipulation by 17 May 2023 23:33 UTC; 34 points) (
- Impact measurement and value-neutrality verification by 15 Oct 2019 0:06 UTC; 31 points) (
- Introduction to inaccessible information by 9 Dec 2021 1:28 UTC; 27 points) (
- 16 Mar 2022 18:43 UTC; 25 points) 's comment on Book Launch: The Engines of Cognition by (
- 29 Jun 2021 2:30 UTC; 23 points) 's comment on paulfchristiano’s Shortform by (
- [AN #158]: Should we be optimistic about generalization? by 29 Jul 2021 17:20 UTC; 20 points) (
- Disentangling Perspectives On Strategy-Stealing in AI Safety by 18 Dec 2021 20:13 UTC; 20 points) (
- [Crosspost] AI Regulation May Be More Important Than AI Alignment For Existential Safety by 24 Aug 2023 16:01 UTC; 14 points) (EA Forum;
- 18 Oct 2019 18:47 UTC; 13 points) 's comment on Impact measurement and value-neutrality verification by (
- 7 Mar 2023 16:27 UTC; 12 points) 's comment on [Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy by (
- [AN #71]: Avoiding reward tampering through current-RF optimization by 30 Oct 2019 17:10 UTC; 12 points) (
- [AN #65]: Learning useful skills by watching humans “play” by 23 Sep 2019 17:30 UTC; 11 points) (
- 28 Jan 2024 22:40 UTC; 11 points) 's comment on Matthew Barnett’s Shortform by (
- 19 Dec 2021 22:13 UTC; 7 points) 's comment on Disentangling Perspectives On Strategy-Stealing in AI Safety by (
- 20 Jun 2022 15:19 UTC; 6 points) 's comment on An AI defense-offense symmetry thesis by (
- 30 Dec 2020 3:37 UTC; 6 points) 's comment on Review Voting Thread by (
- 28 Jan 2024 17:54 UTC; 5 points) 's comment on The case for ensuring that powerful AIs are controlled by (
- 3 Feb 2021 10:06 UTC; 4 points) 's comment on A Critique of Non-Obstruction by (
- 27 Jan 2022 13:58 UTC; 3 points) 's comment on [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now? by (
- 13 Dec 2021 2:40 UTC; 3 points) 's comment on Introduction to inaccessible information by (
- 19 Apr 2023 22:43 UTC; 3 points) 's comment on “Aligned” foundation models don’t imply aligned systems by (
- 10 Oct 2019 14:59 UTC; 2 points) 's comment on In Defence of Conflict Theory by (
- 25 Sep 2019 2:01 UTC; 2 points) 's comment on This is a test post by (
- 19 Apr 2023 10:20 UTC; 1 point) 's comment on All AGI Safety questions welcome (especially basic ones) [April 2023] by (EA Forum;
I think the strategy-stealing assumption is a great framework for analyzing what needs to be done to make AI go well such that I think that the ways in which the strategy-stealing assumption fail shed real light on the problems that we need to solve.
I found this an interesting analysis, and would like to see it reviewed.